This is course is an introductory course in the Data Science which deals with Data Analysis with the help of R.
We use R, Rstudio,GitHub etc.
Link : https://potdarswapnil.github.io/IODS-project
Data Source: International Survey from students enrolled to Introduction to Social Statistics in fall 2014 Data Collection Timeframe : 3.12.2014 - 10.01.2015 Data Creation Date ; 14.01.2015. Sample size: 183 with 63 variables The data set was filtered with columns of interest and with points > 0 to get a dataset of 166 subjects and 7 variables
learnings <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep = ",", header = T)
dim(learnings)
## [1] 166 7
head(learnings)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
str(learnings)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
library(GGally)
## Warning: package 'GGally' was built under R version 3.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
library(progress)
## Warning: package 'progress' was built under R version 3.3.2
plot_data <- ggpairs(learnings, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
print(plot_data)
summary(learnings)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
a. Bar Chart shows Females are nearly double that of males b. Attitute is much higher in males. c. Deep,surface have negative correlations in males and females d. Males have negative correlation of Age with attitute, deep, surface and points e. Females have negative correlation of Age with Surface and Points f. Points correlate heavily with attitude (Cor = 0.437), stra (Cor = 0.146) g. Histograms indicate skewness in age for males and females showing younger population
regression_model <- lm(points ~ attitude + stra + surf, data = learnings)
summary(regression_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learnings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In this multivariate model, we are explaining the variable points against attitude, stra and surf i.e. the explainatory variables. Based on the regression model, points have significant relationship with attitude (summary) stra and surf show no significant relationship with points R-squared value of 0.20 implies that the model can explain 20% or one-fifth of the variation in the outcome.
new_regression_model <- lm(points ~ attitude, data = learnings)
summary(new_regression_model)
##
## Call:
## lm(formula = points ~ attitude, data = learnings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
This univariate model shoes that points is significantly realted to attitude Multiple R-squared: 0.1151 R-squared = Explained variation / Total variation R-squared is always between 0 and 100%: 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. Multiple R-squared is used for evaluating how well the model fits the data. In this case, R-squared value of 0.11 implies that the model can explain only 11% of the variation in the outcome.
my_model2 <- lm(points ~ attitude + stra, data = learnings)
# draw diagnostic plots using the plot() function. Choose the plots 1(Residuals vs Fitted values), 2(Normal QQ-plot) and 5 (Residuals vs Leverage)
par(mfrow = c(2,2))
plot(new_regression_model, which = c(1, 2, 5))
Assumptions of the model a. How well the model descrices the variables we are interested in b. Linearity: The target variable is modelled as a linear combination of the model parameters c. Errors are normally disrtibuted, not correlated and have constant variance Residual vs Fitted plot explains about variance in errors. We could see that some errors deviate from the regression line implying that there is issue with the model QQplot of our model shows that most points fall close to the line but some points are not close on the left hand side of the graph, hence the fit is somewhere near to the normality assumption. The model is reasonably okay. Leverage plot shows the impact of a single observation on the model. There are some observations (values of -3) that have a high impact on the model which is not good.
Read the alcohol consumption data from following location here
Citation: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
Changes in the dataset:
The variables not used for joining the two data have been combined by averaging (including the grade variables) ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’ *‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise
The data are read using the following command:
##Load required libraries
library(GGally)
library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.1
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.3.2
#read the data in the file produced at the end of the assignment
alc<- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",",header=TRUE)
dim(alc)
## [1] 382 35
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
head(alc)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason nursery internet guardian traveltime studytime failures
## 1 course yes no mother 2 2 0
## 2 course no yes father 1 2 0
## 3 other yes yes mother 1 2 3
## 4 home yes yes mother 1 3 0
## 5 home yes no father 1 2 0
## 6 reputation yes yes mother 1 2 0
## schoolsup famsup paid activities higher romantic famrel freetime goout
## 1 yes no no no yes no 4 3 4
## 2 no yes no no yes no 5 3 3
## 3 yes no yes no yes no 4 3 2
## 4 no yes yes yes yes yes 3 2 2
## 5 no yes yes no yes no 4 3 2
## 6 no yes yes yes yes no 5 4 2
## Dalc Walc health absences G1 G2 G3 alc_use high_use
## 1 1 1 3 6 5 6 6 1.0 FALSE
## 2 1 1 3 4 5 5 6 1.0 FALSE
## 3 2 3 3 10 7 8 10 2.5 TRUE
## 4 1 1 5 2 15 14 15 1.0 FALSE
## 5 1 2 5 4 6 10 10 1.5 FALSE
## 6 1 2 5 10 15 15 15 1.5 FALSE
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The purpose of this analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data.
There are numerous parameters which are affected by the alcohol consumption. To analyse and hypothyze different relationships, one can easily study following parameters.
box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box represents area of the first quartile to the third quartile. Line shows median and “whiskers” above and below the box show the locations of the minimum and maximum. Points plotted outside are outliers.
g1 <- ggplot(alc, aes(x = high_use, y = G3, col=sex)) + ggtitle("Impact of Alchohol Consumption on grades")
g1 + geom_boxplot() + ylab("grade")
We can immediately see that when there lower use of alcohol then both males and females have identical medians with males edging a bit higher. Wih high alcohol percentage, males are affected more. ALso interesting to note that median of females’ grade does not change much however upper grade range seem to come down. Males are affected with number of outliers.
g2 <- ggplot(alc, aes(x = high_use, y = absences, col=sex)) + ggtitle("Impact of Alchohol Consumption on absences")
g2 + geom_boxplot() + ylab("Absences")
Again we see males are worse affected by increasing number absences and median going up. With high use absences almost double for males and substantial increase in female absences
g3 <- ggplot(alc, aes(x = high_use, y = health, col=sex)) + ggtitle("Impact of Alchohol Consumption on Student grades and Health")
g3 + geom_boxplot() + ylab("Health") + scale_y_reverse()
Appearantly males health does not suffer much with high consumption but females are more sensitive with higher alcohol consumption
g4 <- ggplot(alc, aes(x = high_use, y = age, col=sex)) + ggtitle("Student absences by alcohol consumption and age")
g4 + geom_boxplot() + ylab("Age")
High consumption at the median age of 17 for males while 16 .5 for females. Males start early with low consumption while females start an year later.
While it is not the entire set of variables. These factors are indicative of the impact of alchol consumption on students.
Bar plots are one of the most commonly used kind of data visualization. They are used to display numeric values (on the y-axis), for different categories (on the x-axis).
bp <- ggplot(alc, aes(health, fill=high_use)) + geom_bar(position="dodge") +
ggtitle("Barplot of health for high and low alcohol usage")
bp
bp2 <- ggplot(alc, aes(sex, fill=high_use)) + geom_bar(position="dodge") +
ggtitle("Barplot of free time counts for high and low alcohol usage")
bp2
bp3 <- ggplot(alc, aes(absences, fill=high_use)) + geom_bar(position="dodge") +
ggtitle("Barplot of absence counts for high and low alcohol usage")
bp3
bp4 <- ggplot(alc, aes(G3, fill=high_use)) + geom_bar(position="dodge") +
ggtitle("Barplot of grades counts for high and low alcohol usage")
bp4
Barplots confirm the findings in the box plots in different way.
b3 <- ggplot(alc, aes(x=high_use, y=absences, fill=health)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle("Student absences by alcohol consumption and health")
# draw the plot
b3
Cross tabulations are type of tables in a matrix format that displays the (multivariate) frequency distribution of the variables. They provide a basic picture of the interrelation between two variables and can help find interactions between them.
library(descr)
## Warning: package 'descr' was built under R version 3.3.2
CrossTable(alc$health,alc$high_use, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ===================================
## alc$high_use
## alc$health FALSE TRUE Total
## -----------------------------------
## 1 35 11 46
## 0.190 0.459
## 0.761 0.239 0.120
## 0.130 0.098
## 0.092 0.029
## -----------------------------------
## 2 29 14 43
## 0.064 0.154
## 0.674 0.326 0.113
## 0.107 0.125
## 0.076 0.037
## -----------------------------------
## 3 63 20 83
## 0.320 0.772
## 0.759 0.241 0.217
## 0.233 0.179
## 0.165 0.052
## -----------------------------------
## 4 47 17 64
## 0.069 0.166
## 0.734 0.266 0.168
## 0.174 0.152
## 0.123 0.045
## -----------------------------------
## 5 96 50 146
## 0.501 1.209
## 0.658 0.342 0.382
## 0.356 0.446
## 0.251 0.131
## -----------------------------------
## Total 270 112 382
## 0.707 0.293
## ===================================
CrossTable(alc$sex,alc$high_use, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ================================
## alc$high_use
## alc$sex FALSE TRUE Total
## --------------------------------
## F 157 41 198
## 2.078 5.009
## 0.793 0.207 0.518
## 0.581 0.366
## 0.411 0.107
## --------------------------------
## M 113 71 184
## 2.236 5.390
## 0.614 0.386 0.482
## 0.419 0.634
## 0.296 0.186
## --------------------------------
## Total 270 112 382
## 0.707 0.293
## ================================
Model the variable high_use as a function of chosen dependent variables using logistic regression.
Logistic regression is a method for fitting a regression curve, y = f(x), when y consists of proportions or probabilities, or binary coded (0,1–failure, success) data. When the response is a binary (dichotomous) variable, and x is numeric, logistic regression fits a logistic curve to the relationship between x and y.
We define the students health as a factorial variable (it takes values 1-5). The model and results (summary and coefficients) are given using following functions:
alc$health <- factor(alc$health)
m <- glm(high_use ~ G3 + absences+sex+health, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ G3 + absences + sex + health, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8620 -0.8454 -0.6204 1.0572 2.0505
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.68344 0.49385 -3.409 0.000652 ***
## G3 -0.03396 0.02591 -1.311 0.189930
## absences 0.07652 0.01854 4.128 3.66e-05 ***
## sexM 1.01459 0.24875 4.079 4.53e-05 ***
## health2 0.36768 0.50514 0.728 0.466686
## health3 -0.11382 0.46137 -0.247 0.805146
## health4 0.19310 0.47462 0.407 0.684118
## health5 0.32903 0.41962 0.784 0.432968
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 421.53 on 374 degrees of freedom
## AIC: 437.53
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) G3 absences sexM health2 health3
## -1.68343883 -0.03395765 0.07651849 1.01458713 0.36768133 -0.11381629
## health4 health5
## 0.19310111 0.32903488
confint(m)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -2.69205669 -0.74729058
## G3 -0.08465022 0.01723061
## absences 0.04227944 0.11462062
## sexM 0.53341598 1.51054655
## health2 -0.61775150 1.37599976
## health3 -1.00690619 0.81490637
## health4 -0.72685195 1.14700931
## health5 -0.47001929 1.18784390
confint.default(m)
## 2.5 % 97.5 %
## (Intercept) -2.65136084 -0.7155168
## G3 -0.08473311 0.0168178
## absences 0.04018536 0.1128516
## sexM 0.52704638 1.5021279
## health2 -0.62237191 1.3577346
## health3 -1.01808292 0.7904503
## health4 -0.73714564 1.1233479
## health5 -0.49340847 1.1514782
#Load the Rquired (usual) libraries
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(reshape2)
library(ggplot2)
library(GGally)
library(dplyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.3.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Warning: package 'tibble' was built under R version 3.3.1
## Warning: package 'tidyr' was built under R version 3.3.1
## Warning: package 'readr' was built under R version 3.3.2
## Warning: package 'purrr' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
## select(): dplyr, MASS
#load the Boston Data
data("Boston")
Load Boston data from the MASS package in R. The data is described in R documentation here. This is related to real estate details from Boston suburban area.
We will focus on the Crime rate and look at the data available.
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
pairs(Boston)
ggpairs(Boston, diag=list(continuous=“density”, discrete=“bar”), axisLabels=“show”)
```
cor_matrix<-cor(Boston) %>% round(digits =2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
library(corrplot)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
In this upper triangular correlation plot, size and color of the “circle” represent the correlation value. We can find following
In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation.
**scaled(x)=x???mean(x)sd(x)
We can use the scale function to scale Boston Data set
boston_scaled <- scale(Boston, center = TRUE, scale = TRUE)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Scaling helps in comparing values in different ranges. We can see that all the columns are now in the same scale and Mean is zero for all the columns.
*Linear Discriminant analysis is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.
boston_scaled <- scale(Boston, center = TRUE, scale = TRUE)
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(scaled_crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
*We look the table of categorical variable to see how many observations are in every class:
r table(crime)
## crime ## low med_low med_high high ## 127 126 126 127 Next we drop the old crime rate variable from the dataset and add the new categorical value to scaled data:
r boston_scaled <- dplyr::select(boston_scaled, -crim) boston_scaled <- data.frame(boston_scaled, crime)
We divide the data to training sets and test sets with a share of 80%. We choose randomly 80% of the rows for the training set, and rest of the data for testing.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Linear discriminant model on training data
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2648515 0.2425743 0.2425743 0.2500000
##
## Group means:
## zn indus chas nox rm
## low 0.9948261 -0.9102353 -0.1619430798 -0.8680255 0.4501078
## med_low -0.1057212 -0.2932917 0.0490668699 -0.5383154 -0.1481657
## med_high -0.3629840 0.1086630 0.2499393309 0.3383128 0.2027381
## high -0.4872402 1.0171306 0.0005392655 1.0869946 -0.3343944
## age dis rad tax ptratio
## low -0.8863805 0.8743938 -0.6813378 -0.7477415 -0.47287929
## med_low -0.3389198 0.3467841 -0.5494382 -0.4719630 -0.07938312
## med_high 0.3869949 -0.3351426 -0.4228725 -0.3522049 -0.30892182
## high 0.8185038 -0.8493727 1.6379981 1.5139626 0.78062517
## black lstat medv
## low 0.3787705 -0.77991463 0.51874285
## med_low 0.3086226 -0.11014226 -0.01909987
## med_high 0.1748623 -0.04053904 0.26126789
## high -0.7521984 0.81974752 -0.66369733
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09310096 0.77744697 -0.8969920
## indus 0.03118109 -0.21992547 0.1738629
## chas -0.06473396 -0.08689175 0.1641463
## nox 0.34549709 -0.71277295 -1.3800802
## rm -0.10618646 -0.12837668 -0.1778779
## age 0.33455136 -0.31064648 -0.2166690
## dis 0.02064713 -0.33956208 0.1575325
## rad 3.34237924 0.96988711 -0.1171776
## tax 0.07332402 -0.06865271 0.6919428
## ptratio 0.12385520 0.03295521 -0.2491231
## black -0.12208325 0.01136212 0.1015168
## lstat 0.25277176 -0.32643400 0.5019637
## medv 0.21950484 -0.44183300 -0.1543635
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9526 0.0349 0.0125
Next we visualize results by using biplot function. The argument dimen in plot() function determines how many discriminants are used.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
Prediction of the LDA model to classify the test data. In order to analyze the model performance we use cross tabulation and create a table consisting of correct and predicted classes.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = test$crime, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 10 9 1 0
## med_low 4 18 6 0
## med_high 0 10 16 2
## high 0 0 0 26
We can use table() function to produce a matrix in order to determine how many observations were correctly or incorrectly classified.
dist_eu <- dist(boston_scaled, method = "euclidean")
## Warning in dist(boston_scaled, method = "euclidean"): NAs introduced by
## coercion
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1394 3.5270 4.9080 4.9390 6.2420 13.0000
K-means kmeans()is an unsupervised method, that assigns observations to groups or clusters based on similarity of the objects. It takes as an argument the number of clusters to look:
r km <-kmeans(dist_eu, centers = 4)
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The optimal number of clusters is achieved when WCSS drops significantly.
The total within sum of squares twcss is defined below. K-means randomly assigns the initial cluster centers using the function set.seed() which might generate minor differences in the plots.
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')
km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)
The results show that twcss drops significantly at k=2. On the last graphics we calculate and plot kmeans function for 2 clusters because it is easier to show.
set.seed(123)
data("Boston")
boston_scaled2 <- as.data.frame(scale(Boston, center = TRUE, scale = TRUE))
dist_eu2 <- dist(boston_scaled2, method = "euclidean")
km2 <- kmeans(dist_eu2, centers = 5)
boston_scaled2$cluster <- km2$cluster
lda.fit2 <- lda(cluster ~ ., data =boston_scaled2)
plot(lda.fit2, col=as.numeric(boston_scaled2$cluster), dimen=2)
lda.arrows(lda.fit2, myscale = 3, col = "#666666")
library(plotly)
## Warning: package 'plotly' was built under R version 3.3.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
train$cluster <- boston_scaled2$cluster[match(rownames(train), rownames(boston_scaled2))]
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=as.factor(train$cluster))
library(GGally)
library(corrplot)
library(tidyr)
library(dplyr)
human <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", header = TRUE)
dim(human)
## [1] 155 8
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
Data Frame with 155 Rows and 8 Columns The human dataset contains various indicators of the well-being of various countries. The summary shows, there are altogether 155 observations (i.e. countries) and these are the variables: Edu2.FM: the ratio of females to males in secondary education Labo.FM: the ratio of females to males in the labour force Edu.Exp: expected number of years spent in schooling Life.Exp: life expectancy in years GNI: gross national income Mat.Mor: the relativised number of mothers who die at child birth Ado.Birth: the rate of teenage pregnancies leading to child birth Parli.F: the percentage of female parliamentarians
#p <- ggpairs(human, mapping = aes(col="blue", alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
ggpairs(human)
# compute the correlation matrix and visualize it with corrplot #cor(human) %>% round(digits=2) %>% corrplot.mixed(tl.cex = 0.7,number.cex=0.7,order=“hclust”,addrect=2)
cor_matrix<-cor(human)%>% round(digits =2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
* Edu.Exp is normally distributed * Edu.Exp is negatively correlated with Ado.Birth * Life.Exp is highly correlated to GNI and Edu.Exp * Life.Exp is negatively correlated to Mat.Mor * Parli.F doesnt have much impact on the other variables but correlaetd to Female to Male labour ratio
pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.5,0.5), col = c("orange", "purple"), main="Biplot of the first two principal components for the unscaled data")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Only GNI is visible Correlation is negligible High variance Qatar seems to be outlier
**Lets scale the data for better interpretation
human_scale <- scale(human)
summary(human_scale)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
**Check PCA after scaling
pca_human <- prcomp(human_scale)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("orange", "purple"), main="Biplot of the first two principal components for the scaled data")
Distribution of Countries explains the individual records. Data is now normalized with different parameters. Countries are now separated with respect to correlations with Principal Components
**Observations * PC1 has high correlation with Mat.Mor * PC2 has high correlation with Parli.F * High correlation between Mat.Mor, Ado.Birth, Edu.Exp, Edu2.FM, Life.Exp and GNi, * Life.Exp, GNI, Edu.Exp, Edu2.FM is opposite side of Mat.Mor and Ado.Birth which indicates importance of Edu. Exp and Life expectancy * Female representation in Parliament seems to indicate high GNI, Life Expectancy and greater Female to Male labor ratio.
In the case of categorical variables dimensionality reduction can be provided using correspondence analysis (CA, n=2 variables) or multiple corresponding analysis (MCA, n>2). MCA is a generalization of PCA and an extension of CA. In this analysis cross tabulation matrices between all variables in the data set are used. The information from cross-tabulations has to be presented in clear graphical form. MCA could be used for example as a pre-processing for clustering.
Here we use tea data from FactoMineR package.
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.3.2
library(tidyr)
library(ggplot2)
library(dplyr)
data("tea")
keep_columns <- c("Tea","price","How", "how", "sugar", "where","lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
## Tea price How
## black : 74 p_branded : 95 alone:195
## Earl Grey:193 p_cheap : 7 lemon: 33
## green : 33 p_private label: 21 milk : 63
## p_unknown : 12 other: 9
## p_upscale : 53
## p_variable :112
## how sugar where
## tea bag :170 No.sugar:155 chain store :192
## tea bag+unpackaged: 94 sugar :145 chain store+tea shop: 78
## unpackaged : 36 tea shop : 30
##
##
##
## lunch
## lunch : 44
## Not.lunch:256
##
##
##
##
str(tea_time)
## 'data.frame': 300 obs. of 7 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ price: Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
Selected Parameters : Tea, price, How, how, sugar, where and lunch.
Graphically the date are presented as barplots using ggplot() function.
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
* Very Few peope drink Cheap tea. * Most people drink tea alone with tea bags and preferred tea variety is Earl Grey at Chain Shops.
Next MCA analysis on the chosen tea data is provided.
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.304 0.249 0.188 0.180 0.156 0.152
## % of var. 13.293 10.896 8.242 7.858 6.846 6.632
## Cumulative % of var. 13.293 24.189 32.431 40.289 47.136 53.767
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.148 0.140 0.130 0.124 0.118 0.101
## % of var. 6.463 6.128 5.691 5.441 5.151 4.428
## Cumulative % of var. 60.230 66.358 72.049 77.490 82.641 87.069
## Dim.13 Dim.14 Dim.15 Dim.16
## Variance 0.095 0.080 0.073 0.048
## % of var. 4.169 3.479 3.193 2.090
## Cumulative % of var. 91.238 94.717 97.910 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.466 0.238 0.050 | 0.477 0.305 0.053 | -0.312
## 2 | -0.192 0.040 0.024 | -0.024 0.001 0.000 | -0.633
## 3 | -0.293 0.094 0.116 | -0.032 0.001 0.001 | -0.142
## 4 | -0.410 0.185 0.221 | -0.003 0.000 0.000 | 0.228
## 5 | -0.293 0.094 0.116 | -0.032 0.001 0.001 | -0.142
## 6 | -0.486 0.260 0.099 | 0.348 0.162 0.050 | -0.096
## 7 | -0.293 0.094 0.116 | -0.032 0.001 0.001 | -0.142
## 8 | -0.192 0.040 0.024 | -0.024 0.001 0.000 | -0.633
## 9 | 0.569 0.356 0.150 | -0.552 0.407 0.141 | -0.121
## 10 | 0.812 0.723 0.321 | -0.401 0.216 0.078 | -0.636
## ctr cos2
## 1 0.172 0.023 |
## 2 0.708 0.256 |
## 3 0.036 0.027 |
## 4 0.092 0.068 |
## 5 0.036 0.027 |
## 6 0.016 0.004 |
## 7 0.036 0.027 |
## 8 0.708 0.256 |
## 9 0.026 0.007 |
## 10 0.715 0.196 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.442 2.263 0.064 4.370 | 0.094 0.125
## Earl Grey | -0.222 1.492 0.089 -5.158 | -0.182 1.225
## green | 0.309 0.492 0.012 1.876 | 0.855 4.614
## p_branded | -0.602 5.401 0.168 -7.090 | 0.410 3.056
## p_cheap | -0.440 0.212 0.005 -1.175 | 0.377 0.190
## p_private label | -0.717 1.692 0.039 -3.401 | 0.560 1.260
## p_unknown | -0.849 1.357 0.030 -2.998 | 0.635 0.925
## p_upscale | 1.555 20.082 0.519 12.454 | 0.472 2.253
## p_variable | 0.028 0.014 0.000 0.373 | -0.768 12.621
## alone | -0.017 0.008 0.001 -0.392 | 0.146 0.798
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.928 | -1.103 22.739 0.398 -10.909 |
## Earl Grey 0.060 -4.231 | 0.424 8.782 0.325 9.853 |
## green 0.090 5.198 | -0.009 0.001 0.000 -0.055 |
## p_branded 0.078 4.828 | -0.073 0.127 0.002 -0.856 |
## p_cheap 0.003 1.008 | -0.323 0.185 0.002 -0.863 |
## p_private label 0.024 2.658 | 0.206 0.226 0.003 0.978 |
## p_unknown 0.017 2.241 | -0.047 0.007 0.000 -0.165 |
## p_upscale 0.048 3.777 | -0.038 0.019 0.000 -0.305 |
## p_variable 0.351 -10.246 | 0.066 0.124 0.003 0.884 |
## alone 0.040 3.447 | -0.077 0.289 0.011 -1.804 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.090 0.104 0.416 |
## price | 0.612 0.354 0.009 |
## How | 0.055 0.087 0.392 |
## how | 0.619 0.492 0.013 |
## sugar | 0.051 0.003 0.316 |
## where | 0.700 0.604 0.059 |
## lunch | 0.000 0.099 0.114 |
plot(mca, invisible=c("ind"), habillage = "quali")